54 research outputs found

    QuickCast: Fast and Efficient Inter-Datacenter Transfers using Forwarding Tree Cohorts

    Full text link
    Large inter-datacenter transfers are crucial for cloud service efficiency and are increasingly used by organizations that have dedicated wide area networks between datacenters. A recent work uses multicast forwarding trees to reduce the bandwidth needs and improve completion times of point-to-multipoint transfers. Using a single forwarding tree per transfer, however, leads to poor performance because the slowest receiver dictates the completion time for all receivers. Using multiple forwarding trees per transfer alleviates this concern--the average receiver could finish early; however, if done naively, bandwidth usage would also increase and it is apriori unclear how best to partition receivers, how to construct the multiple trees and how to determine the rate and schedule of flows on these trees. This paper presents QuickCast, a first solution to these problems. Using simulations on real-world network topologies, we see that QuickCast can speed up the average receiver's completion time by as much as 10×10\times while only using 1.04×1.04\times more bandwidth; further, the completion time for all receivers also improves by as much as 1.6×1.6\times faster at high loads.Comment: [Extended Version] Accepted for presentation in IEEE INFOCOM 2018, Honolulu, H

    DCCast: Efficient Point to Multipoint Transfers Across Datacenters

    Full text link
    Using multiple datacenters allows for higher availability, load balancing and reduced latency to customers of cloud services. To distribute multiple copies of data, cloud providers depend on inter-datacenter WANs that ought to be used efficiently considering their limited capacity and the ever-increasing data demands. In this paper, we focus on applications that transfer objects from one datacenter to several datacenters over dedicated inter-datacenter networks. We present DCCast, a centralized Point to Multi-Point (P2MP) algorithm that uses forwarding trees to efficiently deliver an object from a source datacenter to required destination datacenters. With low computational overhead, DCCast selects forwarding trees that minimize bandwidth usage and balance load across all links. With simulation experiments on Google's GScale network, we show that DCCast can reduce total bandwidth usage and tail Transfer Completion Times (TCT) by up to 50%50\% compared to delivering the same objects via independent point-to-point (P2P) transfers.Comment: 9th USENIX Workshop on Hot Topics in Cloud Computing, https://www.usenix.org/conference/hotcloud17/program/presentation/noormohammadpou

    Increasing the robustness of networked systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Includes bibliographical references (p. 133-143).What popular news do you recall about networked systems? You've probably heard about the several hour failure at Amazon's computing utility that knocked down many startups for several hours, or the attacks that forced the Estonian government web-sites to be inaccessible for several days, or you may have observed inexplicably slow responses or errors from your favorite web site. Needless to say, keeping networked systems robust to attacks and failures is an increasingly significant problem. Why is it hard to keep networked systems robust? We believe that uncontrollable inputs and complex dependencies are the two main reasons. The owner of a web-site has little control on when users arrive; the operator of an ISP has little say in when a fiber gets cut; and the administrator of a campus network is unlikely to know exactly which switches or file-servers may be causing a user's sluggish performance. Despite unpredictable or malicious inputs and complex dependencies we would like a network to self-manage itself, i.e., diagnose its own faults and continue to maintain good performance. This dissertation presents a generic approach to harden networked systems by distinguishing between two scenarios. For systems that need to respond rapidly to unpredictable inputs, we design online solutions that re-optimize resource allocation as inputs change. For systems that need to diagnose the root cause of a problem in the presence of complex subsystem dependencies, we devise techniques to infer these dependencies from packet traces and build functional representations that facilitate reasoning about the most likely causes for faults. We present a few solutions, as examples of this approach, that tackle an important class of network failures. Specifically, we address (1) re-routing traffic around congestion when traffic spikes or links fail in internet service provider networks, (2) protecting websites from denial of service attacks that mimic legitimate users and (3) diagnosing causes of performance problems in enterprises and campus-wide networks. Through a combination of implementations, simulations and deployments, we show that our solutions advance the state-of-the-art.by Srikanth Kandula.Ph.D

    Beyond SynFloods: Guarding Web Server Resources from DDoS Attacks

    Get PDF
    Problem. Denial-of-Service attacks on web servers take many forms. In this paper, we look at a new breed of application-level attacks. An attacker compromises a large number of dummy clients (by means of a worm, virus or Trojan horse) and causes the clients to flood the web server with well-formed HTTP requests that download large files or generate complex database queries. Such requests cause the web server to expend costly server resources like sockets, disk bandwidth, database sub-system bandwidth and worker processes on these dummy users. As a result, performance seen by legitimate users will degrade, eventually leading to denial of service. These attacks are hard to counter as the malicious requests are indistinguishable from legitimate requests at the server. Further, the dummy requests arrive from a large number of geographically distributed machines; thus, they cannot be filtered on source IP addresses or arrival patterns. Prior work has looked at network/transport level DDoS attacks such as SYN flood and bandwidth attacks Approach. Despite the distributed nature of clients participating in a DDoS attack, typically a small group of human operators initiates and manages the attack. By requiring the clients to interact with their human operator before they access server resources, we limit the speed of the DDoS attack and make the human attacker a shared bottleneck. In our system, a web-server can be in either of two modes, NOR-MAL and UNDER ATTACK. The server behavior is unchanged in NORMAL mode. When the web server perceives resource depletion beyond an acceptable limit it shifts to the UNDER ATTACK mode. In this mode, the server continues to serve connections that were established during the NORMAL mode. The server asks new clients to solve a puzzle that is easy to solve by a human but difficult to compute by a machine, before providing access to the system. Depending upon the desired level of protection, the puzzle could be a variation on the following text: "We are suspecting a DDoS attack on Foo. To access Foo, type in the text box à after replacing the number 6 by 2" or a URL embedded in an image -a CAPTCHA One concern is that the user might not be willing to solve the puzzle. In this case, our system behaves like current systems which handle these attacks by asking the user to "come back later". The user can still choose to ignore the puzzle and "come back later"; solving the puzzle enables the user immediate access to the server. Challenges. Incorporating a human in the loop has been used to counter automated user account creation and e-mail spam. However, using this approach to prevent a DDoS attack on web server resources is different due to the following challenges. 1. The puzzle should be sent and validated without allocating any TCB's or sockets at the server, while ensuring correct TCP congestion control semantics. 2. The client's TCP stack and the browser should not be modified. 3. Mechanism should be transparent to web caches. 4. A normal user would have to manually enter the key just once per browsing session, potentially consisting of multiple TCP connections. 5. If the system is experiencing a flash crowd rather than a DDoS attack, the mechanism should be benign. 6. Validation should be independent of the source IP address as malicious users could share IP with ordinary users due to NAT or spoofing. 7. One puzzle allows access to only one client, so it is useless for an attacker to solve a puzzle and distribute it to a large number of worms. 8. Switching from NORMAL to UNDER ATTACK mode (and vice versa) should be inexpensive and transparent to ongoing sessions. 9. Mechanism should work when requests are handled by a server farm. Ongoing Implementation. We are working on an implementation running Apache on Linux to address the above challenges -some aspects of which are discussed below. Currently, we use CAPTCHAlike images(1-2 pkts) as puzzles but are experimenting with natural language puzzles which are smaller in size. The puzzle is returned in an HTML form. To solve the puzzle, a human user types the answer(key) and submits the form creating an HTTP request containing GET /validate?answer=KEY . On a new connection request (i.e. SYNs to the web server), we want to send a puzzle and validate the key without allocating any TCB's or sockets at the server. The server responds to SYN packets with a SYN Cookie. The client receives the SYN cookie, increases its congestion window to two packets, transmits a SYNACKACK and the first data packet that usually contains the HTTP request. The kernel at the server end does not create a new socket upon completion of the TCP handshake. Instead the SYNACK-ACK packet is discarded. When the server receives the client's data packet, if the header of the HTTP request is not of the form GET /validate?answer=KEY , then this packet begins an HTTP session and is not an attempt at validating a key. The server replies with a new puzzle (1-2 pkts) as the HTTP response and immediately resets the connection (using the TCP RST flag). Otherwise, the kernel checks the cryptographic validity of the key. If the check succeeds, a socket is established and the request is delivered to the application. Note that this scheme preserves TCP congestion control semantics and prevents attacks that hog TCB's and sockets by establishing connections that exchange no data. The above scheme creates the following per-session overhead when the server is in UNDER ATTACK mode; two hashes to validate the answer, a few memory accesses to look at HTTP headers, fetching a puzzle and sending it to the client. To ensure that a user needs to solve a puzzle once even if the session contains multiple HTTP 1.0 connnections, the server uses a cookie at the client. Again note that the attacker cannot mount an attack by replicating a cookie because each cookie is mapped to a single key and the server constrains the number of connections using the same key to be small (e.g. four). Assuming that worms are not equipped with OCR software or natural language parsers, the rate at which malicious clients gain access is equal to the rate at which the human operators solve puzzles. To prevent attackers from distributing one puzzle's answer to a herd of clients, the server constrains the number of active TCP connections per key

    Mitigating the Performance Impact of Network Failures in Public Clouds

    Full text link
    Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers \textit{mitigate} the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700×\times in some cases
    • …
    corecore